emp_title: Job title.
emp_length: Number of years in the job, rounded down. If longer than 10 years, then this is represented by the value 10.
state: Two-letter state code.
home_ownership: The ownership status of the applicant's residence.
annual_income: Annual income.
verified_income: Type of verification of the applicant's income.
debt_to_income: Debt-to-income ratio.
annual_income_joint: If this is a joint application, then the annual income of the two parties applying.
verification_income_joint: Type of verification of the joint income.
debt_to_income_joint: Debt-to-income ratio for the two parties.
delinq_2y: Delinquencies on lines of credit in the last 2 years.
months_since_last_delinq: Months since the last delinquency.
earliest_credit_line: Year of the applicant's earliest line of credit
inquiries_last_12m: Inquiries into the applicant's credit during the last 12 months.
total_credit_lines: Total number of credit lines in this applicant's credit history.
open_credit_lines: Number of currently open lines of credit.
total_credit_limit: Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.
total_credit_utilized: Total credit balance, excluding a mortgage.
num_collections_last_12m: Number of collections in the last 12 months. This excludes medical collections.
num_historical_failed_to_pay: The number of derogatory public records, which roughly means the number of times the applicant failed to pay.
months_since_90d_late: Months since the last time the applicant was 90 days late on a payment.
current_accounts_delinq: Number of accounts where the applicant is currently delinquent.
total_collection_amount_ever: The total amount that the applicant has had against them in collections.
current_installment_accounts: Number of installment accounts, which are (roughly) accounts with a fixed payment amount and period. A typical example might be a 36-month car loan.
accounts_opened_24m: Number of new lines of credit opened in the last 24 months.
months_since_last_credit_inquiry: Number of months since the last credit inquiry on this applicant.
num_satisfactory_accounts: Number of satisfactory accounts.
num_accounts_120d_past_due: Number of current accounts that are 120 days past due.
num_accounts_30d_past_due: Number of current accounts that are 30 days past due.
num_active_debit_accounts: Number of currently active bank cards.
total_debit_limit: Total of all bank card limits.
num_total_cc_accounts: Total number of credit card accounts in the applicant's history.
num_open_cc_accounts: Total number of currently open credit card accounts.
num_cc_carrying_balance: Number of credit cards that are carrying a balance.
num_mort_accounts: Number of mortgage accounts.
account_never_delinq_percent: Percent of all lines of credit where the applicant was never delinquent.
tax_liens: a numeric vector
public_record_bankrupt: Number of bankruptcies listed in the public record for this applicant.
loan_purpose: The category for the purpose of the loan.
application_type: The type of application**: either individual or joint.
loan_amount: The amount of the loan the applicant received.
term: The number of months of the loan the applicant received.
interest_rate: Interest rate of the loan the applicant received.
installment: Monthly payment for the loan the applicant received.
grade: Grade associated with the loan.
sub_grade: Detailed grade associated with the loan.
issue_month: Month the loan was issued.
loan_status: Status of the loan.
initial_listing_status: Initial listing status of the loan. (I think this has to do with whether the lender provided the entire loan or if the loan is across multiple lenders.)
disbursement_method: Dispersement method of the loan.
balance: Current balance on the loan.
paid_total: Total that has been paid on the loan by the applicant.
paid_principal: The difference between the original loan amount and the current balance on the loan.
paid_interest: The amount of interest paid so far by the applicant.
paid_late_fees: Late fees paid by the applicant
It is difficult to understand various grading factors which are crucial for assigning a grade to the borrower. Let's try to visualize how each factor plays a vital role in determing the interest rate.
The location of a given loan is another factor to consider when making investment decisions. Each local market is different and may affect how well the completed property will sell and how much it will fetch. Lower-risk projects tend to be located in areas with strong real estate markets.
Additionally, location also heavily influences the ease of foreclosure in the event it is necessary, which may be a factor worth considering for some investors. This is especially true in mortgage loan.
Surprisingly, the state with highest overall interest rate is Hawaii which actually has the lowest mortgage interest rate. Here, we are considering personal loan as well so that can be the reason why Hawaii has higher interest rate. Though we need to filter out these records based on type of loan, etc to gain more insight.
We have to refer state-specific laws and interest rate set by the Federal reserve to keep track of the interest rate accurately.
Here, we can depict which state has highest interest rate. Though interest rate depends on various factors, main creteria can be per capita income, poverty, taxes, tax systems of that state.
Loans can be assigned one of seven letter grades from A to G, and each grade generally reflects the overall risk of the loan. For example, Grade A loans generally have lower expected returns, lower expected loan losses, and corresponding lower interest payments; whereas on the other end of the spectrum, Grade G loans have higher expected returns, higher potential loan losses, but correspondingly higher interest rates. With Groundfloor, you create a custom portfolio of real estate investments based on your own investment criteria and risk tolerances.
Above we can depict that interest starts rising with the drop in associated grade with that loan. This relation in the graph seems accurate.
As interest rate is mostly just dependent on grade of the loan, we can start analying the relation between assigned grades and various variabls
Moreover, we also noticed that interest rate does not rely on any other varibles given in the dataset. In practice, the grade is assigned between A1 and G5 which means that our dataset does follow that format which is a good thing. Next, the grade is determined based on FICO score which we are not given. But assuming that Grades were calculated from FICO score itself, we can take this into consideration. credit score can be estimated with credit limit and credit used which gives us credit utilization score as portion of your FICO score is determined by credit utilization.
We can plot the visualization of credit utilization for further analysis. Credit utilization rate has proven to be extremely predictive of future repayment risk. So it is often an important factor in a person's score. Generally speaking, the higher your utilization rate is, the greater is the risk that you will default on a credit account within the next two years.
Here, we can depict that the credit utilization plays a vital role in determing the FICO score which is used to assign the grades to the borrower. As the credit utilization score starts rising, the grade assigned is much lower. By analyzing the above chart, It is safe to say that 0.3 credit utilization is holds true here on this dataset as well. Credit utilization above 0.3 can bring the grade down.
One outlier here is grade G4 which has low credit utilization. It is because there is only one record of having G4 sub grade and that is why we can count is an an outlier.
Here, the loan amount also brings down your grade as it depends on your credit limit, etc and when we ask to borrow more money compared to our credit limit, the grade can go down. Important thing to note here is that the distribution is quite dispersed for each Grade. But for grade G, the total Loan amount is quite high which suggests that because the users had bad credit score and the total loan amount is higher, the chances of them having a lower grade is obvious.
The major issue with the dataset is that it doesn't make any sense if you try to analyze the relations between each variable. Clearly it is a dummy dataset or it was refined a lot. Many relations between two varibles are uniformly distributed.
The emp_title does some typos and the emp_title could have been segregated into fewer roles as well. We have more than 4000 unique roles which are so difficult to analyze all at once and make valid visualizations. For example, we have values such as "6th grade teacher", "teacher",etc which we don't require unless some special case.
emp_title. So we can drop the rows where emp_title is empty so that we don't make any wrong decision based on that. emp_length missing values based on the emp_title.